5 - Optimization

In this lecture note, we learn about the basics of optimization in high dimensions. In particular, we learn about derivative tests for multivariate functions, we review properties such as smoothness and convexity, and analyze standard methods like gradient descent and stochastic gradient method.

5.1 Unconstrained Optimization and High-Dimensional Derivatives

We will write our optimization problems in the form

$$ \min_{\mathbf{w} \in \mathbb{R}^d} f(\mathbf{w}). $$

This statement should be interpreted as wishing to find the minimum value of a multivariate function $f$ over all possible vectors of variables $\mathbf{w}.$ We will refer to the function $f$ as the objective or loss function. For the problems in data science we consider, $\mathbf{w}$ will typically correspond to a parameter (or weight) vector that determines the model that we use, while $f$ will typically have the interpretation of the average error (or mismatch) between the learning model and the data.

To state the algorithms we will use and even to check if any vector is a (local) minimum of $f$, we need to know how to work with high-dimensional derivatives of $f$. Fortunately, these are not hard to define and we will only look at the first and the second derivative. The first derivative of $f$ is a $d$-dimensional vector called the gradient and is denoted by $\nabla f(\mathbf{w})$. It is a vector that simply stacks all partial derivatives of $f$:

$$ \nabla f(\mathbf{w}) = \begin{bmatrix}\frac{\partial f}{\partial w_1}(\mathbf{w})\\ \frac{\partial f}{\partial w_2}(\mathbf{w})\\ \vdots\\ \frac{\partial f}{\partial w_d}(\mathbf{w}). \end{bmatrix} $$

Similarly, the second high-order derivative is a matrix that we call the Hessian matrix and denote by $\nabla^2 f(\mathbf{w})$. It is defined by

$$ \nabla^2 f(\mathbf{w}) = \begin{bmatrix}\frac{\partial^2 f}{\partial w_1^2}(\mathbf{w}) & \frac{\partial^2 f}{\partial w_1 \partial w_2}(\mathbf{w}) &\dots & \frac{\partial^2 f}{\partial w_1 \partial w_d}(\mathbf{w})\\ \frac{\partial^2 f}{\partial w_2 \partial w_1}(\mathbf{w}) & \frac{\partial^2 f}{\partial w_2^2}(\mathbf{w}) & \dots & \frac{\partial^2 f}{\partial w_2 \partial w_d}(\mathbf{w})\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial w_d \partial w_1}(\mathbf{w}) & \frac{\partial^2 f}{\partial w_d \partial w_2}(\mathbf{w}) & \dots & \frac{\partial^2 f}{\partial w_d^2}(\mathbf{w}). \end{bmatrix} $$

In this class, whenever we work with a function that has a Hessian, it will be a symmetric matrix, meaning that that entries $ij$ and $ji$ are the same, for any $i, j \in \{1, 2, \dots, d\}.$ For this property to hold, it is generally required that the second partial derivatives are continuous.

5.2 Local and Global Minima, High-Dimensional Derivative Test

Similar to the univariate case, we say that some vector $\mathbf{w}_*$ is

Sometimes even finding an approximate local minimizer of $f$ is too much to ask for: we very quickly run into computational intractability issues. However, under mild assumptions about $f$ we can guarantee to find a vector $\mathbf{w}$ with small magnitude of the gradient $\|\nabla f(\mathbf{w})\|_2.$ We refer to points with the zero gradient as the stationary points. Stationary points can be local maxima, local minima, or neither. Example illustrations are provided below.

You will recall from your calculus classes (and maybe some of our past lectures), that every local minimum $w_*$ of a univariate function $f$ needs to satisfy $f'(w_*) = 0$ and $f''(w_*) \geq 0$ (necessary condition). On the other hand, if a point $w_*$ is such that $f'(w_*) = 0$ and $f''(w_*) > 0$, then $w_*$ must be a (strict) local minimizer of $f$ (sufficient condition). There is a similar derivative test that can be applied to multivariate functions, but now because the second derivative is a matrix, we need to be careful with what we mean by the second derivative being non-negative or positive (for the first derivative, the condition is simply $\nabla f(\mathbf{w}_*) = \mathbf{0}$). It turns out that the 'correct' notion of non-negativity for the Hessian matrix is that it is positive semi-definite, which is something you may have seen in your linear algebra classes. We will say what this means more precisely below.

Necessary Conditions: If $\mathbf{w}_* \in \mathbb{R}^d$ is a local minimizer of (2-times continuously differentiable) $f$, then it must be:

$$ \nabla f(\mathbf{w}_*) = \mathbf{0} \;\; \text{ and } \;\; \mathbf{v}^\top \nabla^2 f(\mathbf{w}_*) \mathbf{v} \geq 0, \; \forall \mathbf{v} \in \mathbb{R}^d. $$

In other words, the gradient at $\mathbf{w}_*$ must be the zero vector and the Hessian at $\mathbf{w}_*$ must be a positive semi-definite matrix.

Sufficient Conditions: If $\mathbf{w}_* \in \mathbb{R}^d$ satisfies

$$ \nabla f(\mathbf{w}_*) = \mathbf{0} \;\; \text{ and } \;\; \mathbf{v}^\top \nabla^2 f(\mathbf{w}_*) \mathbf{v} > 0, \; \forall \mathbf{v} \in \mathbb{R}^d \; \text{ s.t. }\; \mathbf{v} \neq \mathbf{0} , $$

then $\mathbf{w}_*$ must be a local minimizer of $f.$

In other words, if the gradient of $f$ at $\mathbf{w}_*$ is the zero vector and the Hessian of $f$ at $\mathbf{w}_*$ is positive definite, then $\mathbf{w}_*$ must be a local minimizer of $f.$

It can be quite difficult to determine by hand whether a symmetric $d \times d$ matrix is positive (semi)definite. However, an equivalent characterization of such matrices is via eigenvalues: a matrix is positive semi-definite if and only if its smallest eigenvalue is non-negative. Similarly, a matrix is positive definite if and only if its smallest eigenvalue is (strictly) positive. There are multiple packages & functions in Python (like numpy.linalg.eig and scipy.sparse.linalg.eigs) that can compute the minimum eigenvalue (or even the entire eigen decomposition) of a symmetric matrix, so if you have a candidate vector $\mathbf{w}_*$ for a function $f$ that is twice (continuously) differentiable, you can perform the derivative test.

There is also a simple example of a matrix for which we can guarantee it is positive semi-definite: if a matrix $\mathbf{A}$ can be written as a square of another matrix $\mathbf{B}.$ In particular, if $\mathbf{A} = \mathbf{B}^\top \mathbf{B},$ then for any vector $\mathbf{v}$ we have

$$ \mathbf{v}^\top \mathbf{A}\mathbf{v} = \mathbf{v}^\top \mathbf{B}^\top \mathbf{B}\mathbf{v} = (\mathbf{B}\mathbf{v})^\top \mathbf{B}\mathbf{v} = \|\mathbf{B}\mathbf{v}\|_2^2 \geq 0. $$

An example of a matrix that we can write in this form is the covariance matrix $\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^\top]$ of any random vector $\mathbf{x}$ with mean $\mathbb{E}[\mathbf{x}] = \boldsymbol{\mu}$.

Another simple example would be a diagonal matrix: a matrix whose only non-zero elements are on its diagonal. Then it is known that in this case the entries on the diagonal are the eigenvalues of the matrix (with standard basis vectors as the eigenvectors), and we can simply check if the matrix is positive (semi)definite by looking at whether all the eigenvalues are positive (or non-negative).

5.3 Convexity

You may be wondering if there are functions for which we can guarantee that any point that is a local minimum is also a global minimum. It turns out that there is a very important class of functions that satisfy such a property, called convex functions (they satisfy even more: any stationary point is a global maximum). This is not the largest class of functions with the property that all local minima are global, but it is perhaps the most well studied one.

We say that a function $f$ is convex if for any two vectors $\mathbf{w}_1$ and $\mathbf{w}_2$ and any $\alpha \in (0, 1)$ we have

$$ f(\alpha \mathbf{w}_1 + (1-\alpha)\mathbf{w}_2) \leq \alpha f(\mathbf{w}_1) + (1 - \alpha) f(\mathbf{w}_2). $$

An immediate consequence of convexity is the very useful Jensen's inequality, which states that if $\mathbf{w}$ is a random vector drawn from some distribution $\mathcal{D}$ and $f$ is a convex function, then

$$ f(\mathbb{E}_{\mathbf{w}\sim \mathcal{D}}[\mathbf{w}]) \leq \mathbb{E}_{\mathbf{w}\sim \mathcal{D}}[f(\mathbf{w})]. $$

The code below provides an illustration of Jensen's inequality.

You may have also heard of concave functions. These are simply the functions $g$ such that $f(\cdot) = -g(\cdot)$ is convex.

In this course, we will mainly be concerned with functions that are differentiable (and actually, almost any function that you are likely to encounter will be differentiable almost everywhere, so we are not losing much by assumming this). Convex functions that are differentiable (meaning their gradient exists) can be equivalently defined as functions that satisfy the following for all vectors $\mathbf{w}_1, \mathbf{w}_2:$

$$ f(\mathbf{w}_1) \geq f(\mathbf{w}_2) + \nabla f(\mathbf{w}_2)^\top (\mathbf{w}_1 - \mathbf{w}_2). $$

Geometrically, what this means is that the tangent to a convex function (or its linear approximation) lies below it (everywhere). In one dimension, it is illustrated by the Python code snippet below. In two dimensions, the linear approximation would be a (lower-bounding) plane touching (but otherwise not intersecting) the function; in higher dimensions these are called lower bounding hyperplanes. They correspond to the right-hand side of the inequality describing convexity above, i.e., the function $f(\mathbf{w}_2) + \nabla f(\mathbf{w}_2)^\top (\mathbf{w}_1 - \mathbf{w}_2)$ that treats $\mathbf{w}_2$ as a fixed vector and $\mathbf{w}_1$ as its vector of variables. So for any $\mathbf{w}_2,$ a convex function $f(\mathbf{w}_1)$ at any point $\mathbf{w}_1$ lies "above" its linear approximation $f(\mathbf{w}_2) + \nabla f(\mathbf{w}_2)^\top (\mathbf{w}_1 - \mathbf{w}_2)$ that treats $\mathbf{w}_2$ as a constant.

If a function is further twice continuously differentiable (meaning that the Hessian exists everywhere), then it is convex if and only if its Hessian $\nabla^2 f(\mathbf{w})$ is positive semi-definite at every vector $\mathbf{w}$ (compare this to the derivative test we discussed earlier).

5.4 Some Regularity Properties of Objective Functions

In general, convex or not, we do not know how to minimize arbitrary functions (and there are computational hardness results that prevent us from doing so) $-$ we always need to make some regularity assumptions about $f$. One such regularity assumption that is commonly used and satisfied in problems such as linear regression and logistic regression is called smoothness. A function is said to be smooth if its gradient is Lipschitz continuous, meaning that there exits some positive constant $L$ such that for any two vectors $\mathbf{w}_1, \mathbf{w}_2$ we have

$$ \|\nabla f(\mathbf{w}_1) - \nabla f(\mathbf{w}_2)\|_2 \leq L \|\mathbf{w}_1 - \mathbf{w}_2\|_2. $$

In other words, if we change the vector $\mathbf{w}$ by a little bit, the gradient of the function cannot change too much. If we can write the above inequality for some concrete value $L$, we will also say that the function is $L$-smooth. All smooth functions will satisfy the following inequality, which we will end up using in the analysis of algorithms: for all $\mathbf{w}_1, \mathbf{w}_2$,

$$ f(\mathbf{w}_1) \leq f(\mathbf{w}_2) + \nabla f(\mathbf{w}_2)^\top (\mathbf{w}_1 - \mathbf{w}_2) + \frac{L}{2}\|\mathbf{w}_1 - \mathbf{w}_2 \|_2^2. $$

All the functions we have seen in Python illustrations so far are smooth.

An example of a function that is continuous but not smooth is provided below. You can easily check that at all the "kinks" where the different linear pieces join (at $x = 0$ and $x = 2$), the derivative is discountinous.

We will sometimes assume that instead of being smooth a function has bounded norm of the gradient at every point, meaning that there exists some positive number $M$ such that for all $\mathbf{w} \in \mathbb{R}^d,$ we have $\|\nabla f(\mathbf{w})\|_2 \leq M.$ This property is satisfied by functions that are Lipschitz-continuous, meaning that for any two vectors $\mathbf{w}_1$ and $\mathbf{w}_2$ we have

$$ |f(\mathbf{w}_1) - f(\mathbf{w}_2)| \leq M\|\mathbf{w}_1 - \mathbf{w}_2\|_2. $$

The piecewise linear function illustrated above is Lipschitz continuous (has bounded slope).

5.5 Gradient Descent

Gradient descent is one of the most basic optimization algorithms that is also quite natural (so much so that it is was discovered by Cauchy in 1847; see also https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf): the idea is to move in the direction opposite of the gradient, which you can think of as going down a hill. Specifically, we start from an arbitrary vector $\mathbf{w}_0$ (e.g., $\mathbf{w}_0 = \mathbf{0}$) and update the (candidate) solution vector using the negative direction of the gradient at each iteration $t \geq 0$:

$$ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \nabla f(\mathbf{w}_t). $$

This is an iterative method and we usually refer to the running solution vectors $\mathbf{w}_t$ as the "iterates" of the algorithm.

We need very minimal effort to justify such an approach: if a function is sufficiently "regular" (for example, if it is smooth), we can guarantee to reduce the function value. So this seems like a reasonable approach and one that would at least avoid local maxima (provided we did not start at one). However, unless we had a convex function (or a function that is in a sense close to being convex), there is no guarantee of landing at a local minimum: the best we can hope for it to converge to is a stationary point. We will formally argue this in what follows.

5.5.1 Sufficient Descent and Convergence to a Stationary Point

The basic version of the gradient descent method applies when the function is $L$-smooth for some constant $L > 0.$ In this case, the method is also called "steepest descent" and we can argue that if we choose the step size to be "not too large," then we are guaranteed to decrease the value of the function. Note that for this basic claim we do not require that the function is convex; only that it is smooth.

To formally argue this, all that we need is the inequality that holds for all $L$-smooth functions that we stated earlier:

$$ f(\mathbf{w}') \leq f(\mathbf{w}) + \nabla f(\mathbf{w})^\top(\mathbf{w}' - \mathbf{w}) + \frac{L}{2}\|\mathbf{w}' - \mathbf{w}\|_2^2, \;\;\; \forall \mathbf{w}, \mathbf{w}' \in \mathbb{R}^d. $$

In particular, plugging $\mathbf{w} = \mathbf{w}_t$ and $\mathbf{w}' = \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \nabla f(\mathbf{w}_t)$ into the above inequality, we get that

$$ f(\mathbf{w}_{t+1}) \leq f(\mathbf{w}_t) + \nabla f(\mathbf{w}_t)^\top(-\eta_t \nabla f(\mathbf{w}_t)) + \frac{L}{2}\|-\eta_t \nabla f(\mathbf{w}_t)\|_2^2 = f(\mathbf{w}_t) - \Big(\eta_t - \frac{\eta_t^2 L}{2}\Big)\|\nabla f(\mathbf{w}_t)\|_2^2. $$

Thus, if choose $\eta_t$ so that $\eta_t - \frac{\eta_t^2 L}{2} > 0$, which, solving for $\eta_t$ is the same as $0 < \eta_t < \frac{2}{L},$ we are guaranteed to decrease the function value (that is, get $f(\mathbf{w}_{t+1}) < f(\mathbf{w}_t)$) as long as $\nabla f(\mathbf{w}_t) \neq \mathbf{0}.$ But if the gradient is the zero vector, we would have converged to a stationary point! Now, this point may or may not be a local minimum, but it turns out that under the assumptions we are making, this is the best we can hope to get and in many applications having a point with zero (or small) gradient is good enough.

To be more specific, let us take the step size to be constant $\eta_t = \eta$ and such that $0 < \eta \leq \frac{1}{L}$ (for example, we can take $\eta = 1/L$). Then the inequality we have derived above leads to what is known as the "sufficient decrease" or "sufficient descent" property of gradient descent:

$$ f(\mathbf{w}_{t+1}) - f(\mathbf{w}_t) \leq - \Big(\eta - \frac{\eta^2 L}{2}\Big)\|\nabla f(\mathbf{w}_t)\|_2^2 \leq - \frac{\eta}{2}\|\nabla f(\mathbf{w}_t)\|_2^2, $$

where we used $\eta \leq \frac{1}{L}$ and so $\frac{\eta^2 L}{2} \leq \eta \cdot \frac{1}{L}\cdot \frac{L}{2} = \frac{\eta}{2}.$

The sufficient descrease property above can be used to argue about how fast we can make the gradient small; i.e., to get a rate of convergence. We only need a mild assumption that the function $f$ is bounded below, meaning that there exists some $f_* > -\infty$ such that $f(\mathbf{w}) \geq f_*$ for all vectors $\mathbf{w} \in \mathbb{R}^d$. This assumption guarantees that function $f$ has at least one stationary point. Without such an assumption we cannot say much: consider the univariate function $f(w) = w;$ its derivative is equal to 1 everywhere so it cannot be made arbitrarily small. This function is not bounded below $-$ it does not have a minimizer on the real line; if we were to try to minimize it, we would keep decreasing $w$ towards $-\infty$.

The sufficient decrease property we have derived for gradient descent holds no matter where we start from and for any iteration $t$. If we look at any iteration $t \geq 1$ and apply the sufficient decrease property recursively, we get

\begin{align*} f(\mathbf{w}_{t+1}) &\leq f(\mathbf{w}_t) - \frac{\eta}{2}\|\nabla f(\mathbf{w}_t)\|_2^2\\ &\leq f(\mathbf{w}_{t-1}) - \frac{\eta}{2}\|\nabla f(\mathbf{w}_{t-1})\|_2^2 - \frac{\eta}{2}\|\nabla f(\mathbf{w}_t)\|_2^2\\ &\;\vdots\\ &\leq f(\mathbf{w}_0) - \frac{\eta}{2}\|\nabla f(\mathbf{w}_0)\|_2^2 - \frac{\eta}{2}\|\nabla f(\mathbf{w}_1)\|_2^2 - \dots - \frac{\eta}{2}\|\nabla f(\mathbf{w}_t)\|_2^2. \end{align*}

Grouping all the gradient norm squared terms into a summation and rearranging the last inequality, we equivalently have:

$$ \sum_{k=0}^t \|\nabla f(\mathbf{w}_k)\|_2^2 \leq \frac{2}{\eta}(f(\mathbf{w}_0) - f(\mathbf{w}_{t+1})). $$

Since the function is bounded below, we also have $f(\mathbf{w}_{t+1}) \geq f_*$ and so

$$ \sum_{k=0}^t \|\nabla f(\mathbf{w}_k)\|_2^2 \leq \frac{2}{\eta}(f(\mathbf{w}_0) - f_*). $$

Now we got something that increases with $t$ on the left-hand side (all summation terms are nonnegative and are actually strictly positive if not already at a stationary point), while the right-hand side is a fixed finite quantity, independent of $t$. This tells us that the gradient norm cannot forever be large $-$ it must asymptotically go to zero as $t$ tends to infinity. But we can say even more. If we divide both sides by $t+1$, on the left-hand side we will get the average squared gradient norm, while the right-hand side will be approaching zero with rate $\frac{1}{t+1}:$

$$ \frac{1}{t+1}\sum_{k=0}^t \|\nabla f(\mathbf{w}_k)\|_2^2 \leq \frac{\frac{2}{\eta}(f(\mathbf{w}_0) - f_*)}{t+1}. $$

The gradient norms in the above inequality are just some numbers. The average of numbers, as you know, cannot be larger than their minimum. So we can also write that

$$ \min_{0\leq k \leq t}\|\nabla f(\mathbf{w}_k)\|_2^2 \leq \frac{\frac{2}{\eta}(f(\mathbf{w}_0) - f_*)}{t+1}. $$

In other words, within the first $t$ iterations, the smallest gradient (the one with the smallest magnitude/norm) that we see is guaranteed to have norm at most equal to $\frac{\frac{2}{\eta}(f(\mathbf{w}_0) - f_*)}{t+1}$. Put differently, if we want to find a solution point $\mathbf{w}$ such that $\|\nabla f(\mathbf{w})\|_2 \leq \epsilon$ for some target error $\epsilon > 0,$ then by running gradient descent and looking at the vector $\mathbf{w}_k$, $0\leq k \leq t$ with the minimum gradient norm, we are guaranteed to have $\|\nabla f(\mathbf{w}_k)\|_2 \leq \epsilon$ within $t = \lceil \frac{\frac{2}{\eta}(f(\mathbf{w}_0) - f_*)}{\epsilon^2} \rceil$ iterations.

You may be wondering about this result and whether we can improve it in some ways. In general $-$ we cannot. In particular, if we are only accessing the function's gradient, then this is the best we can do in terms of the number of iterations to output a vector $\mathbf{w}$ with $\|\nabla f(\mathbf{w})\|_2 \leq \epsilon$ in the worst case. We cannot say whether such a vector would be an approximate local minimum or an approximate saddle point (neither a local minimum nor a local maximum), but we know it cannot be an approximate local maximum (unless we started exactly at a local maximum), as we have a descent method: each step reduces the function value and to get to a local max, we would need to increase it ("climb the hill"). If the function is twice (Lipschitz continuously) differentiable, then we can compute its Hessian and check whether our solution looks like an approximate local minimum (based on the eigenvalues of the Hessian) and potentially keep moving if it is not an approximate local minimum (think about going down the side of the saddle we saw in one of the illustrations earlier in the lecture). It is possible to analyze such an approach, but it goes beyond what we cover in this class.

Finally, you may be wondering whether we can get a guarantee on the gradient of the last iterate $\mathbf{w}_t$ as opposed to the "best" one (the one with the smallest gradient). The answer is unfortunately "no" and this can be seen even in one-dimensional examples. Consider the example illustrated in the Python code snippet below. We have a nonconvex function, which locally where we look at it looks mostly concave. The iterates of gradient descent are indicated by the red triangles, with increasing iteration count corresponding to the triangles going from left to right. In this one dimensional example, the magnitude of the gradient (the one-dimensional derivative) corresponds to how steep the slope of the function is (see also the right plot, which shows the magnitude of the gradient). At the left-most triangle, where we initialized the function, the function is mostly flat, meaning that the slope (and thus the gradient magnitude) is small, as can also be seen in the right plot. As we move from left to right, the function becomes steeper, meaning that the magnitude of the gradient increases. Thus the smallest gradient that we see among the iterates of gradient descent in this case is the initial one and we conclude that we cannot in general claim that the magnitude of the last gradient will decrease (at any rate or at all).

5.5.2 Reducing the Optimality Gap for Convex Functions

We now discuss how once we assume that the function is, in addition, convex we can obtain results on reducing the optimality gap $f(\mathbf{w}) - f_*.$ Recall that we know that for convex functions a stationary point (a point with zero gradient) must also be a minimizer (it is in fact an "if and only if" relationship in this case).

To carry out the analysis, we will assume that there exists a minimizer $\mathbf{w}_*$ of $f$ on $\mathbb{R}^d.$ The key idea is to observe how the distance of the iterates $\mathbf{w}_t$ to $\mathbf{w}_*$ (squared) changes over the iterations. Before we delve into the analysis, we recall how to "expand" the square for vectors. Let $\mathbf{a}$ and $\mathbf{b}$ be some vectors in $\mathbb{R}^d.$ Then

$$ \frac{1}{2}\|\mathbf{a} - \mathbf{b}\|_2^2 = \frac{1}{2}\|\mathbf{a}\|_2^2 + \mathbf{a}^\top \mathbf{b} + \frac{1}{2}\|\mathbf{b}\|_2^2. $$

If it is not clear where this comes from, then use the definitions of the Euclidean norm $\frac{1}{2}\|\mathbf{a} - \mathbf{b}\|_2^2 = \frac{1}{2}\sum_{i=1}^d (a_i - b_i)^2$ and the inner product $\mathbf{a}^\top \mathbf{b} = \sum_{i=1}^d a_i b_i$ and expand the square coordinate-wise, for scalars $(a_i - b_i)^2.$

We will analyze gradient descent with constant step size $\eta_k = \eta \in (0, 1/L].$ Recall that in the past section we showed that in this case we must have the sufficient decrease property:

$$ f(\mathbf{w}_{t+1}) - f(\mathbf{w}_t) \leq - \frac{\eta}{2}\|\nabla f(\mathbf{w}_t)\|_2^2. \;\;(*) $$

Compared to the previous section, we now in addition have that $f$ is convex. What this property crucially allows us to do is to bound below the minimum function value $f(\mathbf{w}_*),$ which in turn enables us to bound the optimality gap, as we will see in a bit. In particular, the convexity of $f$ allows us to say that for any $\mathbf{w}_t$, we have

$$ f(\mathbf{w}_*) \geq f(\mathbf{w}_t) + \nabla f(\mathbf{w}_t)^\top(\mathbf{w}_* - \mathbf{w}_t). \;\;(**) $$

Properties ($*$) and ($**$) are the main things we know about our problem, so in what follows we will crucially use them both to analyze gradient descent.

We begin the analysis by expanding the square $\frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 = \frac{1}{2}\|\mathbf{w}_t - \mathbf{w}_* - \eta \nabla f(\mathbf{w}_t)\|_2^2,$ where we recall $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla f(\mathbf{w}_t)$ by the gradient descent update. In particular, we have

$$ \frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 = \frac{1}{2}\|\mathbf{w}_t - \mathbf{w}_*\|_2^2 - \eta \nabla f(\mathbf{w}_t)^\top (\mathbf{w}_t - \mathbf{w}_*) + \frac{\eta^2}{2}\|\nabla f(\mathbf{w}_t)\|_2^2. $$

We now use ($*$) and ($**$) to bound above the second and third term in the above equality. Using ($*$), we have that

$$ - \eta \nabla f(\mathbf{w}_t)^\top (\mathbf{w}_t - \mathbf{w}_*) = \eta \nabla f(\mathbf{w}_t)^\top (\mathbf{w}_* - \mathbf{w}_t) \leq \eta(f(\mathbf{w}_*) - f(\mathbf{w}_t)). $$

On the other hand, a rearrangement of ($**$) gives:

$$ \frac{\eta^2}{2}\|\nabla f(\mathbf{w}_t)\|_2^2 \leq \eta (f(\mathbf{w}_t) - f(\mathbf{w}_{t+1})). $$

Thus, we get that

$$ \frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 \leq \frac{1}{2}\|\mathbf{w}_t - \mathbf{w}_*\|_2^2 + \eta(f(\mathbf{w}_*) - f(\mathbf{w}_t)) + \eta (f(\mathbf{w}_t) - f(\mathbf{w}_{t+1})) = \frac{1}{2}\|\mathbf{w}_t - \mathbf{w}_*\|_2^2 + \eta(f(\mathbf{w}_*) - f(\mathbf{w}_{t+1})). $$

A rearrangement of the last inequality now gives:

$$ f(\mathbf{w}_{t+1}) - f(\mathbf{w}_*) \leq \frac{1}{2\eta}\|\mathbf{w}_t - \mathbf{w}_*\|_2^2 - \frac{1}{2\eta}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2. $$

If we sum the above inequality for $t = 0, 1, 2, \dots, T$, we get a telescoping sum on the right-hand side, which leads to

$$ \sum_{t=0}^T (f(\mathbf{w}_{t+1}) - f(\mathbf{w}_*)) \leq \frac{1}{2\eta}\|\mathbf{w}_0 - \mathbf{w}_*|_2^2 - \frac{1}{2\eta}\|\mathbf{w}_{T+1} - \mathbf{w}_*\|_2^2 \leq \frac{1}{2\eta}\|\mathbf{w}_0 - \mathbf{w}_*|_2^2, $$

as $\|\mathbf{w}_{T+1} - \mathbf{w}_*\|_2^2 \geq 0.$ On the other hand, gradient descent is a descent method: it reduces the value of the function in each iteration; we also see this from the sufficient descent property ($*$). This means that $f(\mathbf{w}_{T+1}) \leq f(\mathbf{w}_T) \leq \dots \leq f(\mathbf{w}_0)$ and so it must also be that $\sum_{t=0}^T (f(\mathbf{w}_{t+1}) - f(\mathbf{w}_*)) \geq (T+1) (f(\mathbf{w}_{T+1}) - f(\mathbf{w}_*)).$ Combining with the above inequality this leads to

$$ (T+1) (f(\mathbf{w}_{T+1}) - f(\mathbf{w}_*)) \leq \frac{1}{2\eta}\|\mathbf{w}_0 - \mathbf{w}_*|_2^2, $$

or, equivalently,

$$ f(\mathbf{w}_{T+1}) - f(\mathbf{w}_*) \leq \frac{1}{2\eta (T+1)}\|\mathbf{w}_0 - \mathbf{w}_*|_2^2. $$

Observe that $\|\mathbf{w}_0 - \mathbf{w}_*|_2^2$ is a fixed quantity $-$ it only depends on the initialization and intrinsic properties of the problem, and so it does not change with the iterations. In other words, we can view $\frac{1}{2\eta}\|\mathbf{w}_0 - \mathbf{w}_*\|_2^2$ as some fixed number/constant, which further means that $\frac{1}{2\eta (T+1)}\|\mathbf{w}_0 - \mathbf{w}_*\|_2^2$ (and, as a consequence, the optimality gap $f(\mathbf{w}_{T+1}) - f(\mathbf{w}_*)$) reduces at rate $1/(T+1)$ as we iterate the algorithm updates. As a result, for any target error $\epsilon > 0,$ we conclude that for the iterates $\mathbf{w}_t$ of gradient descent we have $f(\mathbf{w}_t) - f(\mathbf{w}_*) \leq \epsilon$ after at most $t = \lceil \frac{1}{2\eta \epsilon}\|\mathbf{w}_0 - \mathbf{w}_*|_2^2\rceil$ iterations.

5.6 Stochastic Gradient Method

In learning problems, the objective function is typically expressed as an expectation: we want to solve problems of the form

$$ \min_{\mathbf{w} \in \mathbb{R}^d} f(\mathbf{w}) $$

where $f(\mathbf{w}) = \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[\ell(\mathbf{w}; \mathbf{x}, y)]$, $(\mathbf{x}, y)$ are labeled examples drawn from some (typically unknown) probability distribution $\mathcal{D}$, and $\ell$ is a loss function that evaluates how well the model determined by the weight vector $\mathbf{w}$ fits the data $(\mathbf{x}, y).$

Because the objective function is an expectation, its gradient is also an expectation:

$$ \nabla f(\mathbf{w}) = \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[\nabla \ell(\mathbf{w}; \mathbf{x}, y)]. $$

It is normally unrealistic to assume that we can compute this gradient exactly and use it in our algorithm. Instead, we need to work with gradient estimates. For the learning setting we just discussed, a reasonable estimate would be the gradient of the loss evaluated at a randomly drawn example $(\mathbf{x}, y)$: $\nabla \ell(\mathbf{w}; \mathbf{x}, y)$, where we assume that we draw samples from $\mathcal{D}$ i.i.d.

We will make a fairly general assumption here that instead to gradients of $\nabla f(\mathbf{w})$, we have access to unbiased estimates of the gradient $\mathbf{g}_{\mathbf{w}}$ meaning that $\mathbf{g}_{\mathbf{w}}$ is a random variable dependent on $\mathbf{w}$ such that

$$ \mathbb{E}[\mathbf{g}_{\mathbf{w}}] = \nabla f(\mathbf{w}), $$

where the expectation above should be interpreted as conditioned on $\mathbf{w}$: even though $\mathbf{w}$ can be random, we assume that there is an independent source of randomness in $\mathbf{g}_{\mathbf{w}}$ that makes it a random variable that is an unbiased estimate of the gradient. (Think of the learning examples provided at the beginning of this section, where we can take $\mathbf{g}_{\mathbf{w}} = \nabla \ell(\mathbf{w}; \mathbf{x}, y)$ for $(\mathbf{x}, y)$ drawn from $\mathcal{D}$).

This assumption alone is not sufficient for being able to analyze (any) algorithm or even determine if some vector $\mathbf{w}_*$ is approximately optimal (e.g., has small gradient norm or small optimality gap) with finitely many samples. We need, in addition, something that controls how $\mathbf{g}_{\mathbf{w}}$ varies around its mean. For the purpose of this lecture, we will take the norm squared of $\mathbf{g}_{\mathbf{w}}$ to be bounded by some $G^2>0$ in expectation, in the sense that

$$ \mathbb{E}[\|\mathbf{g}_{\mathbf{w}}\|_2^2] \leq G^2, \; \; \forall \mathbf{w} \in \mathbb{R}^d. $$

It is possible to replace this assumption by weaker assumptions (such as bounded variance, variance that depends on the distance to the set of optima, etc., though we would need additional assumptions about $f$), but this does not significantly influence the analysis of the stochastic gradient method, so we stick with this assumption to keep the things simple.

Stochastic gradient method looks very similar to gradient descent. It starts from some $\mathbf{w}_0 \in \mathbf{R}^d$ and updates its iterates for $t \geq 0$ as

$$ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{g}_t, $$

where we use $\mathbf{g}_t$ to compactly denote $\mathbf{g}_{\mathbf{w}_t}$ (and avoid double subscript), as the context is clear.

We will analyze this method not assuming anything about $f$ except that it is convex. In this case, we cannot establish the sufficient descent property and, unlike gradient descent, this method is not guaranteed to decrease the function value even if we had $\mathbf{g}_t = \nabla f(\mathbf{w}_t)$ (i.e., no randomness). As a consequence, step sizes $\eta_t$ need to be kept quite small and, as we will see shortly, the method converges more slowly than what we established for gradient descent earlier in this lecture. To keep things simple, we will consider constant step sizes $\eta_t = \eta,$ which, as we will soon see, will need to depend on the number of steps for which we plan to run the algorithm.

We begin the analysis in the same way as we did for gradient descent: by expanding the square $\frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2$, where $\mathbf{w}_*$ is a minimizer of $f$. In particular, we have

$$ \frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 = \frac{1}{2}\|\mathbf{w}_{t} - \mathbf{w}_*\|_2^2 - \eta \mathbf{g}_t^\top (\mathbf{w}_t - \mathbf{w}_*) + \frac{\eta^2}{2}\|\mathbf{g}_t\|_2^2. $$

We now take the expectation on both sides, where we condition on whatever randomness determines $\mathbf{w}_t$ and denote that expectation by $\mathbb{E}_t$ so that when we condition and take the expectation, we can treat $\mathbf{w}_t$ as being fixed/deterministic. By our assumptions about $\mathbf{g}_t,$ this means that $\mathbb{E}_t[\mathbf{g}_t^\top (\mathbf{w}_t - \mathbf{w}_*)] = \nabla f(\mathbf{w}_t)^\top (\mathbf{w}_t - \mathbf{w}_*)$ and $\mathbb{E}_t[\|\mathbf{g}_t\|_2^2] \leq G^2.$ As a result, we have that

$$ \mathbb{E}_t\Big[\frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2\Big] \leq \mathbb{E}_t\Big[\frac{1}{2}\|\mathbf{w}_{t} - \mathbf{w}_*\|_2^2\Big] - \eta \nabla f(\mathbf{w}_t)^\top (\mathbf{w}_t - \mathbf{w}_*) + \frac{\eta^2}{2}G^2. $$

By convexity of $f$ (same as what we used in the analysis of gradient descent), we have that $- \eta \nabla f(\mathbf{w}_t)^\top (\mathbf{w}_t - \mathbf{w}_*) \leq \eta(f(\mathbf{w}_*) - f(\mathbf{w}_t)).$ Plugging this into the last inequality and rearranging, we get that

$$ \eta(f(\mathbf{w}_t) - f(\mathbf{w}_*)) \leq \mathbb{E}_t\Big[\frac{1}{2}\|\mathbf{w}_{t} - \mathbf{w}_*\|_2^2\Big] - \mathbb{E}_t\Big[\frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2\Big] + \frac{\eta^2}{2}G^2. $$

Now if directly sum over $t = 0, 1, \dots, T$ as we did in the case of gradient descent, we wouldn't be able to telescope the " squared distance to optimum" terms as we did for gradient descent, as the conditioning in $\mathbb{E}_t$ depends on $t$. This is resolved by taking the expectation with respect to all randomness in the algorithm on both sides to get

$$ \eta\mathbb{E}[f(\mathbf{w}_t) - f(\mathbf{w}_*)] \leq \mathbb{E}\Big[\frac{1}{2}\|\mathbf{w}_{t} - \mathbf{w}_*\|_2^2\Big] - \mathbb{E}\Big[\frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2\Big] + \frac{\eta^2}{2}G^2. $$

Now when we sum over $t = 0, 1, \dots, T$, we get the desired telescoping, leading to

$$ \eta\sum_{t=0}^T \mathbb{E}[f(\mathbf{w}_t) - f(\mathbf{w}_*)] \leq \mathbb{E}\Big[\frac{1}{2}\|\mathbf{w}_{0} - \mathbf{w}_*\|_2^2\Big] - \mathbb{E}\Big[\frac{1}{2}\|\mathbf{w}_{T+1} - \mathbf{w}_*\|_2^2\Big] + (T+1)\frac{\eta^2}{2}G^2 \leq \frac{1}{2}\|\mathbf{w}_{0} - \mathbf{w}_*\|_2^2 + (T+1)\frac{\eta^2}{2}G^2. $$

Dividing both sides by $\eta(T+1)$ leads to

$$ \frac{1}{T+1}\sum_{t=0}^T \mathbb{E}[f(\mathbf{w}_t) - f(\mathbf{w}_*)]\leq \frac{\|\mathbf{w}_{0} - \mathbf{w}_*\|_2^2}{2\eta(T+1)} + \frac{\eta G^2}{2}. $$

Let $\overline{\mathbf{w}}_T = \frac{1}{T+1}\sum_{t=0}^T \mathbf{w}_t.$ Convexity then implies that $f(\overline{\mathbf{w}}_T) \leq \frac{1}{T+1}\sum_{t=0}^T f(\mathbf{w}_t)$ and so we have

$$ \mathbb{E}[f(\overline{\mathbf{w}}_T) - f(\mathbf{w}_*)] \leq \frac{\|\mathbf{w}_{0} - \mathbf{w}_*\|_2^2}{2\eta(T+1)} + \frac{\eta G^2}{2} $$

and it remains to choose $\eta.$ A reasonable choice would be the $\eta$ that minimizes the right-hand side of the last inequality (so that the optimality gap is as small as possible). It is not hard to see that $\eta$ that gives this is $\eta = \frac{\|\mathbf{w}_{0} - \mathbf{w}_*\|_2}{G\sqrt{T+1}},$ leading to

$$ \mathbb{E}[f(\overline{\mathbf{w}}_T) - f(\mathbf{w}_*)] \leq \frac{G\|\mathbf{w}_{0} - \mathbf{w}_*\|_2}{\sqrt{T+1}}. $$

This bound is unimprovable, meaning that under our problem assumptions and (up to absolute constants) in the worst case, we cannot get a better bound for any method. However, this choice of $\eta$ is unrealistic, as we do not normally know the value of $\|\mathbf{w}_{0} - \mathbf{w}_*\|_2$ or even $G.$ If instead we simply choose $\eta = \frac{1}{\sqrt{T+1}},$ we get a slightly worse bound but one that still goes down with $1/\sqrt{T+1}:$

$$ \mathbb{E}[f(\overline{\mathbf{w}}_T) - f(\mathbf{w}_*)] \leq \frac{\|\mathbf{w}_{0} - \mathbf{w}_*\|_2^2 + G^2}{2\sqrt{T+1}}. $$

Put differently, we get that for any $\epsilon > 0,$ we have $ \mathbb{E}[f(\overline{\mathbf{w}}_T) - f(\mathbf{w}_*)] \leq \epsilon$ after at most $T = \lceil \frac{\|\mathbf{w}_{0} - \mathbf{w}_*\|_2^2 + G^2}{2\epsilon^2} \rceil $ iterations.

Compare this result to what we had for gradient descent. First of all, we have a stochastic method, so the guarantee we get holds for the expected optimality gap (whereas for gradient descent we had a guarantee on the deterministic optimality gap, without the expectation). Further, the dependence on the error parameter $\epsilon$ is increased from $1/\epsilon$ to $1/\epsilon^2,$ which is the price we need to pay in general for working with stochastic estimates of the gradient. Finally, the step size $\eta$ depends on the number of iterations we take, meaning that we need to decide on the number of iterations we want to take in advance. This last requirement is possible to relax by choosing step sizes as $\eta_t = \frac{1}{\sqrt{t+1}}$ and carrying out a similar analysis, which leads to an additional logarithmic factor in the final bound on the number of iterations.

Bibliographic Notes

This lecture is based solely on my experience in teaching CS 726 and materials developed therein. The Python examples were created with the help of ChatGPT and edited to suit this lecture.